Atomic move #496

manast · 2017-04-13T13:45:17Z

Implements #480 and #190

bradvogel · 2017-04-15T07:10:22Z

Looks like there's a lot of dead code that can be removed now:
Job.prototype.move
Job.prototype.takeLock
Job.prototype.lockKey
Job.prototype.releaseLock
Job.prototype.delayIfNeeded
Job.prototype.moveToCompleted
scripts.move
all of bull-redlock

lots of others... feels good to clean it up!

bradvogel

Really excited for this! I played around with it locally and it works well. My only concern is that with a very high number of workers (eg we have 12 workers at 50 concurrency each), each worker will spend a lot more CPU now that it has to listen to global messages from other workers processing jobs. So for every single job processed, all workers need to call getNextJob at least once. But that's probably an OK tradeoff for now.

bradvogel · 2017-04-14T21:16:22Z

lib/queue.js

@@ -426,7 +429,10 @@ interface JobOptions
  @param opts: JobOptions Options for this job.
 */
 Queue.prototype.add = function(name, data, opts){
-  return Job.create(this, name, data, opts);
+  var _this = this;
+  return this.isReady().then(function(){


good catch :)

bradvogel · 2017-04-15T06:27:12Z

lib/scripts/moveToActive.lua

+      ARGV[3] lock duration in milliseconds
+]]
+
+local jobId = redis.call("LINDEX", KEYS[1], -1)


Nice! O(1) call to fetch the job.

bradvogel · 2017-04-15T06:39:31Z

lib/scripts/moveToFinished.lua

+
+     Output:
+      0 OK
+      1 Missing key.


Should be -1?

bradvogel · 2017-04-15T07:06:04Z

lib/job.js

-    if (lock){
-      _this.lock = lock;
-    } 
+Job.prototype.takeLock = function(){


This looks like it can be removed - right?

It is only used by a unit test indirectly, will try to get rid of it.

Yeah, would be nice if the unit test implemented it itself. Just to keep the core queue implementation pure and simple.

I don't think takeLock should be tested, since this is meant to be a "private" function, as long as a public function that uses "locks" is tested, it is enough. A code coverage output for the unit tests should tell us if takeLock is sufficiently tested.

yes, it is not tested, but used in some unit test where we need to have a lock: https://github.com/OptimalBits/bull/blob/v3.0.0/test/test_job.js#L100

bradvogel · 2017-04-15T07:10:50Z

lib/scripts/moveToActive.lua

+end
+
+--[[
+  Release lock:


What was this for?

old stuff. removed now.

bradvogel · 2017-04-15T07:44:53Z

lib/queue.js

  this.processJob = this.processJob.bind(this);
  this.getJobFromId = Job.fromId.bind(null, this);
 };

 util.inherits(Queue, Disturbed);

+Queue.ErrorMessages = errors.Messages;


Note this in the changelog?

bradvogel · 2017-04-15T07:45:32Z

lib/queue.js

@@ -284,11 +293,11 @@ Queue.prototype.isReady = function(){
 }

 Queue.prototype.getJobMoveCount = function(){
-  return this.bclient.commandQueue.length;
+  return this.client.commandQueue.length;
 };

 Queue.prototype.whenCurrentMoveFinished = function(){


What was this for originally? Seems like it can be removed to reduce complexity.

I have removed it.

bradvogel · 2017-04-15T07:45:43Z

lib/queue.js

@@ -284,11 +293,11 @@ Queue.prototype.isReady = function(){
 }

 Queue.prototype.getJobMoveCount = function(){


What was this for originally? Seems like it can be removed to reduce complexity.

bradvogel · 2017-04-15T07:48:42Z

lib/scripts.js

   *    We call these jobs 'stalled'. This is the most common case. We resolve these by moving them
   *    back to wait to be re-processed. To prevent jobs from cycling endlessly between active and wait,
   *    (e.g. if the job handler keeps crashing), we limit the number stalled job recoveries to MAX_STALLED_JOB_COUNT.
-
-   *  Case B) The job was just moved to 'active' from 'wait' and the worker that moved it hasn't gotten


You might want to remove This is the most common case above since it's now the only case.

yes, this has been removed.

bradvogel · 2017-04-15T07:50:04Z

lib/scripts.js

+  //
+  // TODO: DEPRECATE.
+  //
+  move: function(job, src, target){


I don't think this is used anymore - remove?

Only used in a unit test. will remove it.

manast · 2017-04-15T23:28:22Z

I pushed some updated addressing the code review. Regarding performance, I do not think it will affect in any measurable manner. If you have a busy queue, the publishing of 'added' event will not be used since the worker will just pick new jobs directly after processing the previous one. If not busy, the extra overhead of the publish will be negligible. Besides, in the previous blocking version we also published similar events, so the overhead was already there, and the blocking call was timed out after 2.5 seconds, effectively making it a polling queue. Overall the queue is probably faster now, I will perform some benchmarks soon. The last thing remaining is a guard timmer and logic for handling the queue properly after a reconnection, we need better unit tests for that.

bradvogel · 2017-04-16T00:36:15Z

lib/queue.js

-    }, function(err){
-      _this.emit('error', err, 'failed to re-obtain lock before moving to failed, bailing');
-    });
+    //  job.takeLock(true /* renew */, false /* ensureActive */).then( function(/*lock*/) {


remove commented out line

bradvogel · 2017-04-16T00:43:00Z

Yes- makes sense. Nice work! Excited for this.

manast · 2017-04-16T16:38:23Z

with these latest commits I feel ready for merging in 3.0.0 branch and I would like to release the first alpha version for the 3.x series.

bradvogel · 2017-04-16T18:02:22Z

Looks good to me to merge

xdc0

Thanks for the great effort here @manast ! I know you already merged but hopefully you can take a look at the comments?

Thanks again !

xdc0 · 2017-04-17T19:17:09Z

lib/commands/addJob-7.lua

+    end
+  end
+
+  redis.call("PUBLISH", KEYS[4], jobId)


Is the top documentation correct? It says that this will emit a waiting event if not added but it looks like it never emits a waiting event, it emits an added event regardless if the queue is paused or not.

good catch. I will fix it.

xdc0 · 2017-04-17T19:37:13Z

lib/commands/moveUnlockedJobsToWait-4.lua

+       back to wait to be re-processed. To prevent jobs from cycling endlessly between active and wait,
+       (e.g. if the job handler keeps crashing), we limit the number stalled job recoveries to MAX_STALLED_JOB_COUNT.
+
+    DEPRECATED CASE:


What does it mean "Deprecated case"? Not clear if this is still an ongoing problem with this move? If still is an issue, how do we know what exactly happened? Worker dropping jobs because it died vs Worker failing to start processing the job due this race condition issue.

I will remove the comment. With single instance redis, this case could not happen anymore since the lock is obtained atomically with the move. In the future, if we want to support redis with replication, we need to re-introduce redlock, but we will be able to take the lock before moving the job, so in practice the case should not happen either.

xdc0 · 2017-04-17T19:49:37Z

lib/job.js

@@ -22,20 +26,15 @@ var Job = function(queue, name, data, opts){
    name = '__default__';
  }

-  opts = opts || {};
+  this.opts = _.extend({}, opts);


Minor thing, no need to address:

You can use _.defaults here to simplify some code here:

this.opts = _.defaults({ attempts: 1, delay: 0, timestamp: Date.now() }, opts); this.delay = this.opts.delay; this.timestamp = this.opts.timestamp;

Probably not needed to reassign those to delay or timestamp since we're setting this.opts anyway?

I will update using _.defaults since it is also semantically better. I feel options should always be kept in the opts field only, so delay should be in opts instead of a property of job, however, delay can also be defined when retrying a job, and in that case it would not be an option but a property used by the queue mechanics. Timestamp is also a bit special since normally it should be an internal property, but sometimes you want to override it, just as job.id.

xdc0 · 2017-04-17T19:55:24Z

lib/job.js

    attemptsMade: this.attemptsMade,
    failedReason: this.failedReason,
    stacktrace: this.stacktrace || null,
    returnvalue: this.returnvalue || null
  };
 };

+Job.prototype.toData = function(){


Minor:

Can probably be simplified with:

var whitelist = ['data', 'opts', 'stacktrace', 'returnvalue']; JSON.stringify(_.pick(this.toJSON(), whitelist));

yes, a nice simplification.

xdc0 · 2017-04-17T19:58:14Z

lib/job.js

-    if (lock){
-      _this.lock = lock;
-    } 
+Job.prototype.takeLock = function(){


I don't think takeLock should be tested, since this is meant to be a "private" function, as long as a public function that uses "locks" is tested, it is enough. A code coverage output for the unit tests should tell us if takeLock is sufficiently tested.

xdc0 · 2017-04-17T20:42:36Z

lib/commands/moveToFinished-3.lua

+
+     Output:
+      0 OK
+      -1 Missing key.


Looking at the lua script here, -1 means both, that the job is locked and that the key is missing.

you are right. I added a -2 error code when a lock is missing. I noticed however that these error codes are not being used, I need to fix it as well, it may be hidding errors right now.

xdc0 · 2017-04-17T20:45:49Z

lib/commands/moveToFinished-3.lua

+
+if redis.call("EXISTS", KEYS[3]) == 1 then -- // Make sure job exists  
+  if ARGV[5] ~= "0" then
+    local lockKey = KEYS[3] .. ':lock'


I've noticed that other scripts, such as moveToDelayed, pause, takeLock doesn't check for lock existence, but other scripts such as this does. It should be consistent how we do lock checking.

I know that if these operations are executed it is implied that a lock exists, but even then, we've run into issues in the past due workers stepping in each other toes (hopefully this fixing some of these problems!) so I'd vote to always check for locks and have Bull complain loudly when it happens, that way these race condition issues can be better understood and controlled.

I will add issue for such an improvement.

xdc0 · 2017-04-17T20:50:40Z

CHANGELOG.md

+
+- job.jobId changed to job.id.
+- refactored error messages into separate error module.
+- completed and failed job states are now represented in ZSETs.


Could you please add the reasoning behind this move?

job.jobId -> job.id to reduce redundancy.

error messages refactored so that we can unit test without needing to copy/paste the messages.

ZSETs allows us to efficiently get ranges of completed and failed jobs, very useful for building UIs and scripts.

xdc0 · 2017-04-17T21:25:35Z

FWIW with respect to event listeners approach - it is actually not as CPU intensive as a polling alternative, if anything, it can be a source of potential memory leaks when having an unbound number of event listeners, nodejs avoids this by having a default max limit of event listeners of 10 to prevent a potential memory leak. The EventEmitter class is actually pretty straightforward, here: https://github.com/nodejs/node/blob/master/lib/events.js

manast · 2017-04-18T08:03:11Z

@chuym thanks for the review I will commit the fixes directly to the 3.0.0 branch.

manast added 8 commits April 6, 2017 23:42

Implemented #190

b6916bb

refactored addJob into an external lua script

73d7b47

Initial implementation of #480

c069b7a

store one processing promise per concurrency level

dc5248d

clean ups and refactoring in lua scripts

7fedea6

atomized job##finished

ff91c4c

processing listeners cleanup

e565ed6

increased timeout

a6f7fc5

This was referenced Apr 13, 2017

Atomic move #493

Closed

Implemented #190 #489

Closed

bradvogel reviewed Apr 15, 2017

View reviewed changes

manast added 5 commits April 15, 2017 23:41

set jobid and wait event to waiting

55da1e5

start end in getJobs. Events with from status

1d26d43

return jobId when moving to active

8473c09

changelog updated

e8e55bb

removed deprecated code. Moved stopTimer to finally.

f3b7585

bradvogel reviewed Apr 16, 2017

View reviewed changes

manast added 4 commits April 16, 2017 12:16

removed obsolete comments

19bfe1b

preload lua scripts

101019f

atomized job##moveToFailed

b8a9428

added slow poll for waiting jobs

34cc0c1

manast merged commit 328ecb7 into v3.0.0 Apr 16, 2017

xdc0 reviewed Apr 17, 2017

View reviewed changes

roggervalf deleted the atomic-move branch May 12, 2024 00:16

		@@ -284,11 +293,11 @@ Queue.prototype.isReady = function(){
		}

		Queue.prototype.getJobMoveCount = function(){

Atomic move #496

Atomic move #496

Conversation

manast commented Apr 13, 2017

bradvogel commented Apr 15, 2017 • edited

bradvogel left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

manast commented Apr 15, 2017 • edited

Choose a reason for hiding this comment

bradvogel commented Apr 16, 2017

manast commented Apr 16, 2017

bradvogel commented Apr 16, 2017

xdc0 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

manast Apr 18, 2017 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

xdc0 commented Apr 17, 2017

manast commented Apr 18, 2017

bradvogel commented Apr 15, 2017 •

edited

manast commented Apr 15, 2017 •

edited

manast Apr 18, 2017 •

edited